fix(hot): split history shards by encoded size, not count#66
Conversation
Closes ENG-2287 `AccountsHistory` and `StorageHistory` are stored DUPSORT, so each dup value (key2 || encoded BlockNumberList) is capped at MDBX's DUPSORT value limit (~1980 B on 4 KB pages). The previous splitter gated on ShardedKey::SHARD_COUNT (2000), but a roaring BlockNumberList of 2000 sparse indices serialises to >20 KB. Once a hot account's shard tipped over, every live block touching it crashed the sidecar with MDBX_BAD_VALSIZE. Replace count-based splitting in append_to_sharded_history with a size-based check against a new ShardedKey::MAX_SHARD_BYTES (1500). When the merged list overflows the budget, binary-search the largest prefix that fits and emit shards until the remainder is consumed. Existing oversized shards self-heal on the next write to them, so no migration is required. Add test_history_shard_fits_in_dupsort_limit to the hot conformance suite. It exercises the real update_history_indices_inconsistent path with 2000 sparse blocks (one per roaring container, the worst-case encoding) and asserts the union across shards round-trips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cf04073 to
c512dff
Compare
| /// Soft cap on the number of indices in one shard. | ||
| /// | ||
| /// This is a sanity ceiling used alongside [`Self::MAX_SHARD_BYTES`]; | ||
| /// shard splitting is driven by encoded size, not by this count. |
There was a problem hiding this comment.
Another thing we should consider is whether we can empirically find size bounds with best and worst case inputs
prestwich
left a comment
There was a problem hiding this comment.
there are writing 3 cases:
- exceeds
MAX_SHARD_BYTES - exceeds
MAX_SHARD_BYTESand the MDBX dupsort max - (boring case) does not exceed
MAX_SHARD_BYTES
previously:
- case 2 would panic during writing
- case 1 & 3 would work
there are no changes to the decoding, so anything written with previous versions can still decode. Different versions would chunk history differently, and therefore have slightly different real-world perf, but there should not be any version compatibility issues, as the history chunking is not consensus critical
- How does the treemap encode the indicies?
- Can we empirically determine a safe number of indices based on the encoding so that we don't have to do a binary search?
- In pursuit of 2, let's add tests to the
storage-typespackage with worst-case indices. - Let's promote the
SHARD_COUNTto a type-associated constant on ourBlockNumberList
|
[Claude Code] Posting on James's behalf — follow-up to the size-based-splitting discussion. Agree the size-based approach is the right call (the count-based alternative imposes ~10× metadata overhead on hot, dense addresses, which dominate real workloads). The objection is to the binary search, not the size budget. Why the current splitter is more expensive than it needs to be
Also, the Three cheaper heuristicsA. Incremental insert + exact size check (simplest). Insert indices one at a time into a working B. Boundary-triggered estimate (O(n)). C. Segment-based greedy pack (O(n), exact). Pre-segment the merged input by Recommendation: Option ASmallest delta from the current PR. Replace let mut shard_start = 0;
let mut shard = BlockNumberList::default();
for (i, &idx) in all.iter().enumerate() {
shard.push(idx)?;
if shard.serialized_size() > ShardedKey::MAX_SHARD_BYTES {
// emit prefix [shard_start..i) — rebuild once at the boundary
let prefix = BlockNumberList::new_pre_sorted(all[shard_start..i].iter().copied());
write_shard(all[i - 1], &prefix)?;
shard_start = i;
shard = BlockNumberList::default();
shard.push(idx)?;
}
}
write_shard(u64::MAX, &shard)?;One roaring rebuild per shard boundary instead of Orthogonal asks from your earlier review still standThe four points from the change-request — characterise roaring encoding, add worst-case tests in |
|
[Claude Code] Follow-up posted on James's behalf — layering observation, separable from the splitting-heuristic discussion above.
The dual-keyed table abstraction itself is fine; that's a deliberate primitive in this storage layer, not an MDBX-ism. The issue is purely that the byte budget is backend-specific while the splitting policy is generic. Minimal cleanup that fixes the layering without touching the trait surface: trait HotKvWrite {
/// Maximum encoded size of a single stored value, in bytes.
/// `None` means unbounded.
fn max_value_bytes(&self) -> Option<usize> { None }
}
This keeps the shared splitter (so we don't duplicate the algorithm per backend), keeps the schema shared (dual-keyed history tables remain the abstraction's primitive), and removes the MDBX-specific constant from
Happy to file this as its own issue and let ENG-2287 land first, or fold it into the same PR if it's small enough. Probably the former — different scope, easier to review separately. |

Closes ENG-2287
AccountsHistoryandStorageHistoryare stored DUPSORT, so each dupvalue (key2 || encoded BlockNumberList) is capped at MDBX's DUPSORT
value limit (~1980 B on 4 KB pages). The previous splitter gated on
ShardedKey::SHARD_COUNT (2000), but a roaring BlockNumberList of 2000
sparse indices serialises to >20 KB. Once a hot account's shard tipped
over, every live block touching it crashed the sidecar with
MDBX_BAD_VALSIZE.
Replace count-based splitting in append_to_sharded_history with a
size-based check against a new ShardedKey::MAX_SHARD_BYTES (1500). When
the merged list overflows the budget, binary-search the largest prefix
that fits and emit shards until the remainder is consumed. Existing
oversized shards self-heal on the next write to them, so no migration
is required.
Add test_history_shard_fits_in_dupsort_limit to the hot conformance
suite. It exercises the real update_history_indices_inconsistent path
with 2000 sparse blocks (one per roaring container, the worst-case
encoding) and asserts the union across shards round-trips.
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com